18 research outputs found

    Segmentation, Diarization and Speech Transcription: Surprise Data Unraveled

    Get PDF
    In this thesis, research on large vocabulary continuous speech recognition for unknown audio conditions is presented. For automatic speech recognition systems based on statistical methods, it is important that the conditions of the audio used for training the statistical models match the conditions of the audio to be processed. Any mismatch will decrease the accuracy of the recognition. If it is unpredictable what kind of data can be expected, or in other words if the conditions of the audio to be processed are unknown, it is impossible to tune the models. If the material consists of `surprise data' the output of the system is likely to be poor. In this thesis methods are presented for which no external training data is required for training models. These novel methods have been implemented in a large vocabulary continuous speech recognition system called SHoUT. This system consists of three subsystems: speech/non-speech classification, speaker diarization and automatic speech recognition. The speech/non-speech classification subsystem separates speech from silence and unknown audible non-speech events. The type of non-speech present in audio recordings can vary from paper shuffling in recordings of meetings to sound effects in television shows. Because it is unknown what type of non-speech needs to be detected, it is not possible to train high quality statistical models for each type of non-speech sound. The speech/non-speech classification subsystem, also called the speech activity detection subsystem, does not attempt to classify all audible non-speech in a single run. Instead, first a bootstrap speech/silence classification is obtained using a standard speech activity component. Next, the models for speech, silence and audible non-speech are trained on the target audio using the bootstrap classification. This approach makes it possible to classify speech and non-speech with high accuracy, without the need to know what kinds of sound are present in the audio recording. Once all non-speech is filtered out of the audio, it is the task of the speaker diarization subsystem to determine how many speakers occur in the recording and exactly when they are speaking. The speaker diarization subsystem applies agglomerative clustering to create clusters of speech fragments for each speaker in the recording. First, statistical speaker models are created on random chunks of the recording and by iteratively realigning the data, retraining the models and merging models that represent the same speaker, accurate speaker models are obtained for speaker clustering. This method does not require any statistical models developed on a training set, which makes the diarization subsystem insensitive for variation in audio conditions. Unfortunately, because the algorithm is of complexity O(n3)O(n^3), this clustering method is slow for long recordings. Two variations of the subsystem are presented that reduce the needed computational effort, so that the subsystem is applicable for long audio recordings as well. The automatic speech recognition subsystem developed for this research, is based on Viterbi decoding on a fixed pronunciation prefix tree. Using the fixed tree, a flexible modular decoder could be developed, but it was not straightforward to apply full language model look-ahead efficiently. In this thesis a novel method is discussed that makes it possible to apply language model look-ahead effectively on the fixed tree. Also, to obtain higher speech recognition accuracy on audio with unknown acoustical conditions, a selection from the numerous known methods that exist for robust automatic speech recognition is applied and evaluated in this thesis. The three individual subsystems as well as the entire system have been successfully evaluated on three international benchmarks. The diarization subsystem has been evaluated at the NIST RT06s benchmark and the speech activity detection subsystem has been tested at RT07s. The entire system was evaluated at N-Best, the first automatic speech recognition benchmark for Dutch

    Towards Affordable Disclosure of Spoken Word Archives

    Get PDF
    This paper presents and discusses ongoing work aiming at affordable disclosure of real-world spoken word archives in general, and in particular of a collection of recorded interviews with Dutch survivors of World War II concentration camp Buchenwald. Given such collections, the least we want to be able to provide is search at different levels and a flexible way of presenting results. Strategies for automatic annotation based on speech recognition – supporting e.g., within-document search– are outlined and discussed with respect to the Buchenwald interview collection. In addition, usability aspects of the spoken word search are discussed on the basis of our experiences with the online Buchenwald web portal. It is concluded that, although user feedback is generally fairly positive, automatic annotation performance is still far from satisfactory, and requires additional research

    Diarization-Based Speaker Retrieval for Broadcast Television Archives

    Get PDF
    Contains fulltext : 94367.pdf (author's version ) (Open Access)Interspeech 2011, 27 augustus 201

    Unsupervised Acoustic Sub-word Unit Detection for Query-by-example Spoken Term Detection

    Get PDF
    Contains fulltext : 94769.pdf (publisher's version ) (Open Access)IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 22 mei 201

    Speaker Diarization Error Analysis Using Oracle Components

    Get PDF
    Contains fulltext : 94680pre.pdf (preprint version ) (Open Access) Contains fulltext : 94680pub.pdf (publisher's version ) (Open Access)11 p

    Large Scale Speaker Diarization for Long Recordings and Small Collections

    No full text

    Speaker diarization using gesture and speech

    Get PDF
    Contains fulltext : 134621.pdf (preprint version ) (Open Access)Interspeech 2014: 15th Annual Conference of the International Speech Communication Association, 14-18 Sept 2014, MAX Atria @ Singapore EXP

    The majority wins: a method for combining speaker diarization systems

    Get PDF
    Contains fulltext : 91349.pdf (author's version ) (Open Access)In this paper we present a method for combining multiple di- arization systems into one single system by applying a majority voting scheme. The voting scheme selects the best segmen- tation purely on basis of the output of each system. On our development set of NIST Rich Transcription evaluation meet- ings the voting method improves our system on all evaluation conditions. For the single distant microphone condition, DER performance improved by 7.8% (relative) compared to the best input system. For the multiple distant microphone condition theInterspeec

    Speech overlap detection in a two-pass speaker diarization system

    Get PDF
    Contains fulltext : 91347.pdf (author's version ) (Open Access)In this paper we present the two-pass speaker diarization system that we developed for the NIST RT09s evaluation. In the first pass of our system a model for speech overlap detection is gen- erated automatically. This model is used in two ways to reduce the diarization errors due to overlapping speech. First, it is used in a second diarization pass to remove overlapping speech from the data while training the speaker models. Second, it is used to find speech overlap for the final segmentation so that overlap- ping speech segments can be generated. The experiments show that our overlap detection method improves the performance of all three of our system configurations.Interspeec

    Semi-automatic labeling of the UCU accents speech corpus

    Get PDF
    Contains fulltext : 135154.pdf (publisher's version ) (Open Access)Ninth International Conference on Language Resources and Evaluation, Reykjavik, 26 mei 201
    corecore